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Abstract 

Consider a class of decomposable combinatorial structures, using different types of atoms Z ~ 
{Zi, . . . , We address the random generation of such structures with respect to a size n and 

a targeted distribution in k of its distinguished atoms. We consider two variations on this problem. 

In the first alternative, the targeted distribution is given by k real numbers ^i, . . . ,fik such 
that < /ii < 1 for all i and fJ,i + ■ • • + fik < 1- We aim to generate random structures among 
the whole set of structures of a given size n, in such a way that the expected frequency of any 
distinguished atom Zi equals /i^. We address this problem by weighting the atoms with a fc-tuple 
TT of real- valued weights, inducing a weighted distribution over the set of structures of size n. We 
first adapt the classical recursive random generation scheme into an algorithm taking C'(n-'^+°^^) + 
mn log n) arithmetic operations to draw m structures from the 7r-weighted distribution. Secondly, 
we address the analytical computation of weights such that the targeted frequencies are achieved 
asymptotically, i. e. for large values of n. We derive systems of functional equations whose 
resolution gives an explicit relationship between tt and ^i, . . . , ^k- Lastly, we give an algorithm in 
0{kn^) for the inverse problem, i.e. computing the frequencies associated with a given fc-tuple tt 
of weights, and an optimized version in 0{kn?') in the case of context-free languages. This allows 
for a heuristic resolution of the weights/frequencies relationship suitable for complex specifications. 

In the second alternative, the targeted distribution is given by a fc natural numbers ni , . . . , 
such that ni + - ■ ■ + nk + r — n where r > is the number of undistinguished atoms. The structures 
must be generated uniformly among the set of structures of size n that contain exactly rii atoms 
Zi (1 < i < fc). We give a 0{r^ IliLi + mnklogn) algorithm for generating m structures, which 
simplifies into a C'C^'IliLi ^i + fnn) for regular specifications. 



1. Introduction 

The problem of uniform random generation of combinatorial structures has been extensively 
studied in the past few years. Notably, the wide class of decomposable structures, that is combi- 
natorial structures that can be constructed recursively in an unambiguous way, has been subject 
to great attention. Two general methods have been developed for the uniform generation of these 
structures: the recursive method [3| and, more recently, the so-called Boltzmann method [ii]. In 
the present paper, we generalize this problem to the problem of generating combinatorial struc- 
tures according to a given (non uniform) distribution. The distribution is defined by the desired 
frequencies of some given atoms in the structures that are generated. 

According to decomposable structures are defined by combinatorial specifications. Briefly, 
a combinatorial specification of a given class C of combinatorial structures is a tuple C of com- 
binatorial classes which are interrelated by means of productions made from basic objects of size 
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zero (empty structures) or size one (atoms), and from constructions (+ for disjoint union, x for 
products, sequence for sequences, set for multisets and cycle for directed cycles). 

We are interested in the following problem. Let C be a combinatorial class, whose set of atoms 
is 2 — {Zi, . . . , Z|^|}. Let us distinguish k < \Z\ atoms in Z, say Zi, . . . Zk- Now let n be 
an integer, and let us denote C„ the set of structures of C of length n. The problem consists in 
generating random structures in C„ while respecting a distribution of the k distinguished atoms. 
We consider two variations of the problem: 

1. Generation according to expected frequencies. The targeted distribution is given by k real 
numbers /ii, . . . , /i^ such that < fXi < 1 for all i and /ii + • ■ ■ + /ifc < 1. The structures must 
respect on the average the given frequency k-tuple. More precisely, we generate structures at 
random in such a way that 

(a) any structure of C„ has a positive probability to be generated; 

(b) for any i £ {1, . . . , fc}, the expected frequence of occurrences of Zi in the structures is 
equal to : if P(s) is the probability of the structure s to be generated by the algorithm, 
we must have X]sgc„ = '^Mi ! 

(c) two structures having the same distribution of the k distinguished atoms have the same 
probability of being generated. 

2. Generation according to exact frequencies. Here the distribution is given by k natural numbers 
ni, . . . , rife such that ni + ri2 + • ■ ■ rifc < n. The distribution of the number of distinguished 
atoms of any structure must respect the given k-tuple exactly. In other words, we generate 
structures uniformly at random in a subset of C„ constituted of all the structures s e C such 
that \s\zi — rii for alH € {1, . . . , fc}, where \s\zi stands for the number of atoms Zi in s. 

The above two problems arise when one tries to model naturally occurring objects or to circum- 
vent some limitations of generative descriptions, therefore both were addressed under fairly specific 
settings. For instance, a non-uniform scheme was used by Brlek et al [3| to perform a generation of 
generalized Motzkin paths according to their area. The generation according to exact frequencies 
was implicitly used in where the problem of randomly generating structures while fixing more 
than one parameter was addressed. One also needs to mention a very elegant Q{n) algorithm for 
generating words from regular languages with two types of atoms 6]. Finally, the original pre- 
sentation of the recent Boltzmann method Q features the generation of adsorbing staircase walks 
according to both the size and number of contacts to the origin. 

Our approach is based on the recursive method, which was initiated by Nijenhuis and Wilf j3|, 
and then generalized and formalized by Flajolet, Zimmermann and Van Cutsem [H. Section E] IS 
devoted to a short presentation of this methodology in the classical context of uniform generation. 
In Section[31 we focus on generating structures according to expected frequencies, with an emphasis 
on the computation of suitable weights. Finally, we present in Section |4] another algorithm which 
allows to generate structures according to exact frequencies. 

2. Combinatorial specifications and uniform generation 

As seen above, a combinatorial specification of a given class C of combinatorial structures is 
a tuple of classes which are interrelated by means of productions made from basic objects (empty 
structures denoted e and atoms, of size and 1 respectively) and from constructions {+ for disjoint 
union, x for products, sequence for sequences, set for multisets and cycle for directed cycles). 

The algorithm works as follows: First translate the specification into a standard one, where 
all products are binary, and the sequence, set, cycle constructions have been replaced with 
the marking and unmarking constructions Q and Q^^ (see p,]). Then the standard specification 
translates directly into procedures for counting the number of structures of a given size generated 
from a given non-terminal (see Table [Ij, or for generating one such object uniformly at random (see 
Table[5]). The computation of all tables up to size n requires 0{n'^) operations on coefficients, which 
can be lowered to 0(n(logn)^ loglogri) by using Joris van der Hoeven's technique for computing 
the coefficients fs*]. Then one random generation needs 0{n\ogn) operations in the worst case 
using the boustrophedonic method. These complexities can be lowered for some particular classes 
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C ^ Ax B ^ 

Table 1: Counting procedures for standard specifications. 
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Case: C = A x B. 

gC := procedure(n: integer); 
[/:=Uniform([0,l]); 
fc := 0; 

5 := ao6„/c„; 
while [/ > S* do 
:= fc + 1; 
5 5 + akbn-k/cn, 
Return((gA(fc),gB(n - fc))) 
end. 



(4) 
(5) 
(6) 



Case: C = 1. 

gC :— procedure(n: integer); 

if n = then Return(l) 
end. 

Case: C — Z. 

gC :— procedure(n: integer); 

if n = 1 then Return(Z) 
end. 

Case: C = A + B. 

gC :— procedure(ri: integer); 
[/:=Uniforni([0, 1]); 
iiU < a„/c„ 

then Return(gA(n)) 
else Return(gB(n)) 

end. 

Table 2: Uniform random generation procedures for standard specifications. The straightforward pointing and 
unpointing cases are omitted. 



of combinatorial structures, notably those that give rise to holonomic generating functions, so 
that the counting sequences satisfy linear recurrences leading to 0(n) operations only for 

computing the tables. This is the case for context-free specifications for example 

The integer coefficients used in the algorithm usually have an exponential growth with respect 
to the size n: 0{nlogn) in the labelled case and 0{n) in the unlabelled case Therefore, with 
Schonhage's multiplication algorithm fl3| for integer arithmetic or Fiirer's recent improvement 13 1, 
the precomputation and the generation have bit-complexity 0(71^+°*^^^). Meanwhile, using adap- 
tative floating point computations, the bit-complexity of the generation step can be lowered to 
0{n^^°^^^) Furthermore, combining 14| and the later work in Q leads to a precomputation 
step in 0{n^~^°^^'^) bit-complexity too. 



Another work extends this approach to unlabeled objects 15|. From now on, we suppose 
we are given an unlabeled standard specification, with union, product, marking and unmarking 
constructions. Tables [1] and [21 respectively, summarize the counting and generating procedures. 
The labeled case is very similar, with additional binomial coefficients. 



3. Generation according to expected frequencies 

3.1. Weighted combinatorial structures and random generation 

In this section, we consider the problem of generating structures of C„ at random in such a 
way that each structure s is generated with positive probability P(s), and the k-tuple of expected 
frequencies of the atoms Zi, . . . , Zk equals the given k-tuple (/ii, . . . , fik). Formally: 

P(s) > Vs e C„ (7) 

and 

^ \s\zMs) = nfi, Vi € {1, 2, . . . , k}. (8) 

sec„ 
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Moreover, any two structures (s, s') G C„ x C„ having the same distribution in atoms Zi, . . . , Zk 
must be equally generated: 

{\s\z^ = W\z^ e {1, . . . , k}) ^ P(s) = P(s'). (9) 

Our method consists in adjoining a fc-tuple of weights tt = (vri, . . . , tt^) to the specification, assigning 
a real- valued weight iTi e M.*^_ to each distinguished atom Zi € 2. The weight of any combinatorial 
structure is then defined to be the product of the weights of its distinguished atoms: 

-(^)= n 
i<i<fc 

and the weight of a finite combinatorial class is the sum of the weights of its members. In particular, 
for C„ we have: 

7r(C„) = tt{s). 

If the algorithm is such that 

m = ^ \fs€c„, (10) 

7r(C„) 

then the larger the weight of any given atom is (with regard to the weights of the other ones) , the 
more this atom occurs in a random sample. On the other hand, formula (jlOp implies conditions ([T]) 
and dSl). 

Now we have to solve two problems: 

1. Find a fc-tuple tt that satisfies (|5]) assuming that (ITU)) holds; 

2. Design a generation algorithm which satisfies ([TU]). 

Let us first solve the latter, for which we adapt the recursive scheme. 



Proposition 1 Suppose that tt is given. Then an adaptation of the recursive approach gives an 
algorithm which takes 0{ti^^°^^^ + rmi log n) arithmetic operations for generating m structures of 
size n such that each structure s is generated with probability P(s). 

In order to generate words with the required distribution ()10|) , we use the methodology presented 
in Section [21 with just a slight change: Now the rule 

C ^ Zi Cl = 7r(Z,;) = TTi. 

replaces rule Q in Table [1] The generation process then works exactly like the uniform one 
described in Section [2] It can be easily shown that the probability of generating a structure s 
occurs will be proportional to its weight tt{s). 

The 0{n\ogn) behavior of a Boustrophedon search follows from the facts that: i) The worst- 
case complexity of the uniform generation is in 0{n log(n)), as was shown in Q; ii) For any sampled 
structure s, the costs of generating s in the weighted and uniform distribution are strictly identical. 
Since the generation cost of any structure is in C'(nlog(n)), then so is the expected cost of a 
generation, regardless of the distribution. 

From now on, given C, tt and n, let us write fT^{Zi, C, n) for the average number of atoms Zi in 
the structures of C„ generated by the above scheme. Our problem is then the following: given the 
k-tuple {^i, . . . , /ifc), find the fc-tuple tt of weights that achieves targeted frequencies, that is such 
that 

f^^lZi, C,n) — n ■ fj,i for all i such that 1 < i < fc. 

We give two different approaches to tackle this problem. The first one, detailed in Subsec- 
tions [X^l is analytic and gives, if some conditions on C hold, asymptotic formulas for fT^{Zi,C,n) 
when n is large, assuming we are able to solve some system of functional equations. By contrast, 
our second programme, described in Subsection 13. 3[ leads to an heuristic for approximating tt in 
the general case. 
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3.2. Computing weights suitable for asymptotical frequencies 
3.2.1. The (non-rational) context-free case 

A combinatorial class is said to be context-free if it can be specified without using set and cycle 
operations. A result of Drmota jl^, applied by Denise et al 17] to the case of weighted context-free 
grammars allows us to foresee a symbolic approach to the computation of weights compatible with 
expected frequencies. More specifically, it defines sufficient conditions such that the number c„ of 
structures of size n asymptotically follows the ubiquitous behavior 



and such that the coefficients c\ that count the total number of symbols in all words of size n 
follow asymptotic expansions of the form 

cl^ n^, -^{l + 0{llM) 

for and K-^^i some explicit constants of n. It follows that a relationship exists between the 
weights and the asymptotical frequencies of occurrence for each atom Zi. This relationship is in 
most cases quite simple, and allows to derive suitable weights tt for reasonable objective k-tuples 
of frequencies (/xi, . . . , /ifc). 

Definition 2 (Simple type specification) Let '3/ = be a set of standard specifications for 

a tuple C of algebraic (context-free) combinatorial classes. 

Let Cni,...,nk,r bc the number of structures of size n = ?' + X]i=i ^ combinatorial class C, having 

Uj occurrences of atom Zj , j € [1 , fc] , and r remaining atoms. 

Then ^' is said to be of simple type if there exists, for each combinatorial class C G C, a k- 
dimensional cone Mi C M*^ that is centered on and saturated such that 

V(ni, . . . , nfe, r) e M n N^-+\ ^ 0. 

Theorem 3 (Asymptotics of algebraic specifications [16]) Let ^ = be a combina- 

torial specification for a m-tuple C — (Ci, . . . , Cm) of combinatorial classes such that: 

1. for any i G [1,™], Ci is not isomorphic to a rational language. 

2. 4* doesn't use any e-production. 

3. is a simple type specification. 

4. is strongly connected. 

For each i G [1, fc] o-nd j G [1, m]: 

- Let Ui be a random complex variable and ni a real valued weight. 

- Let Sj be the multivariate generating function for class Cj . 

- Let ^j{t,ui, . . . ,u\z\, Si, . . . , Sm) be the term obtained from "ii j by replacing Zi byt-'Ki-Ui, 
and Cj by Sj . 



Finally, let A be the Jacobian matrix of ^, such that A = 
Consider the following system: 



ije[i,l*|] 

Slit-KiUi, tTT\z\U\Z\) = Ml, • ■ • , U\z\-,Sl, S\^\) 



S'l*|(i7riui, . . . ,i7r|^|M|^l) = ^\^\{t,ui, . . . ,u\z\, Si, . . . , S\m\) ^^'^^ 
= det(I - A) 

Let {p*^. Si, . . . , 5*1*^1) be a \^\ + 1-tuple of functions of u = {ui, . . . , u\z\), solution of System (jlip 
such that G K+ and is minimal. Then we have: 

/.(Z„C,n) = -^|^(l).n + 0(l) (12) 
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Figure 1: Two equivalent grammars for the Motzkin language along with their dependency graphs. 



The intuition behind the conditions of this theorem is the following: 

- The non-rationality of the corresponding language helps avoiding simple poles, a case where 
the simplifications presented in section [3.2.21 appear. 

- The strongly connected condition ensures that the dominant singularity is the same for all 
functions Si(t, . . . ,t). 

- Furthermore, adding a simple type condition guarantees a square-root type dominant singu- 
larities for all generating functions Si. 

- The value a;^(l, . . . , 1) is the dominant singularity, necessarily positive as we are considering 
series with positive coefficient (Pringsheim's Theorem). 



Remark 4 The original formulation of the Theorem [l6| addresses a wider range of candidate 
systems (fTTj) than the context-free languages, thus it is expected that some of its most stringent 
constraints can sometimes be relaxed. For instance, the coefficients of the equations derived from ^I* 
are positive, which is a real restriction since the class of context-free languages is not closed under 
complement. 

Also, the e-free condition can be relaxed, since it is a classic result that any grammar can be 
transformed into an e-free one generating the same language. 

Lastly, a property that might be too stringent is the strong- connectedness, whose role is to avoid 
some complicated cases where several concurrent singularities may interfere, e. g. giving rise to 
oscillating asymptotic behaviors. Indeed, many concrete examples show that, as can be verified 
through singularity analysis 18|, correct frequencies can be predicted by mean of the theorem 
although their graphs are not strongly connected. 

Some of these examples are purely artefactual, a phenomenon illustrated by the two grammars 
from Figure [TJ In this example, the two grammars have different dependency graphs, and grammar 
G trivially does not meet the strong-connectedness criteria of theorem |31 despite generating the 
same combinatorial class. One can even build classes of languages such that the conclusions of 
theorem |3] applies, whereas the language cannot be generated by any strongly-connected grammar. 
For instance, one may consider all sorts of fc-ary trees whose leaves are sequences of a dedicated 
axiom. 

Therefore it remains to propose a tighter characterization of eligible specifications, not necessar- 
ily based on the structure of the system (not sufficiently informative) or on properties of associated 
generating functions (solving some of these systems may be challenging) but rather on intrinsic 
properties of the associated combinatorial classes. Such a characterization remains a challenging 
problem at the moment. 



Figure 2: Convergence toward the asymptotic regimes (Dashed lines) of the proportions fc (Solid lines) of unary 
nodes among 7r-weighted unary/binary trees of size n. Five values for the couple (n, fc) are shown here (From top 
to bottom): (10,5/6), (2,1/2), (1,1/3), (1/2,1/5), and (1/10,1/21). 
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Example 1 (Motzkin words/Unary-binary trees) 

Motzkin words are the easiest and most ubiquitous representant of the context-free class of languages 
for which two atoms can occur independently. They are also known to be in bijection with the rooted 
trees having nodes of degrees 1 and 2. They are generated by the following context-free grammar: 

S^aSbS\cS\e 

Through weighting the terminal letter c with a real-valued weight tt and marking the terminal 
symbol c with a complex variable u, we get the following expression for $5^ 

S'7r(i, tu) — $5^ {t, U, Sjr) = tS-„{t, tu)tSTr{t, tu) + tUTrSTr{t, tu) + 1. (13) 

Since there is only one non-terminal (e.g. combinatorial class) 5, the Jacobian is reduced to a 1 x 1 
matrix A such that: 

A = 2fS^{t,tu) +TTUt 

and 

dct {I- A) = 1- 2t'^S^{t, tu) - nut. (14) 
Putting together equations [T51 and [Til from above yields the following system 

S7r{t,tu) — tS.n{t,tu)tS'„{t,tu) + tunS-nit^tu) + \ , . 

= l~2t^sJ{t,tu)~TTut ^ ' 



whose solutions for t are 



Taking the positive solution t"*" and applying equation ([T^ yields the following weight tt that 
achieves an asymptotic frequency fc for the terminal symbol c 

2/c 



l-/c 



It is then possible to gain full control over the asymptotic frequency for terminal letters c 
and (a, 6). Although in principle this relationship holds only for large values of n, a fairly quick 
convergence toward the asymptotic regime is observed, as can be seen in Figure [2l Also, the impact 
of the weight on this convergence, although noticeable, does not seem too drastic. Alternatively, 
the three types of atoms can be weighted with a triplet (TTa, TTf,, tTc) and the weight/frequency 
relationship remarkably simplifie^0 into TTa = T^t = fa = fb, and tTc = fc with fa-'rfb-^fc^^- 

Since these letters map respectively to unary and binary branches through the classic unary- 
binary tree bijection, we can draw random instances of weighted unary-binary trees. We get the 
typical behaviors exhibited in Figure |3] for increasing values of tt. 



Example 2 (Binary arithmetic expressions) 

Another class of structures that can be seen as a context-free language is the language of arithmetic 
expressions. We will restrict our operations to the addition and substraction and accept only 
numbers having one binary digit. This yields the following grammar, given in polish notation 
(prefix form) to avoid potential ambiguity: 

E + E E \ - E E \ N 

N ^ 0\1 



Average value of an expression: Although this problem can probably be solved exactly through 
bivariate generating function techniques, we choose a random generation approach to get a rough 



^As was pointed out by an anonymous reviewer. 



7 



idea of the influence of the number of occurrences of the + symbol over the average asymptotic 
value of an arithmetic expression. Therefore, we adjoin a weight TT-f to the atom + that will be 
used to control its frequency /_(- . Also we define the length n of a binary expression to be the length 
of its encoding, ie its number of terminal symbols. 

As shown previously, the above unambiguous context-free grammar can be translated into a 
system of functional equations. Solving the system gives the length generating functions associated 
with each non-terminal. In particular for E, we have 



- 2t{l + un+) 

with u and t respectively marking only + and any atom. 

The above generating function, after some basic singularity analysis, yields 

Unsurprisingly, it is impossible to find a weight 7r+ such that more than 50% of the symbols are 
+'s, which follows directly from the binary tree-like structure of our expressions. 

One can also adjoin a second weight tti to each occurrence of the atom 1, along with a new 
complex variable v. Solving the new system yields the following generating functions: 



l-y/l-U^{l+U7T+){l+V7T,) 

E^^^^,{t,u,v) = ^ — — 

2t(?i7r+ + 1) 

Again it is possible to link the asymptotic frequency /i (resp. /_(-) for 1 (resp. +) with both 
weights 7r+ and tti , which yields 

A remarkable property here is the absence of correlation between the frequencies of 1 and +, once 
again due to the tree-like structure of arithmetic expressions. We can then use these equations to 
estimate the average value of an arithmetic expression having different proportions of 1 and +'s. 
A random generation of 100 000 expressions for sizes ranging from 1 to 200 allows us to conjecture 
a size- independent average value when 7r+ = 1 (See Figure H]). 

Exact analysis of the — 1 case : In the 7r_|. = 1 case, it is an interesting fact that the average 
value E(y„) of an expression is in fact independent from n. More specifically, it can be shown that 

E(K) = -^,Vn> 1. 

This can be proven by induction on n, since 



1 + TTl 1 + TTl 1 -t- TTl 

and that assuming E(Vfe) = 7ri/(l + tti), < n yields 

n—1 n—1 
k>l k>l 

2^1 



k>l 

where (resp. PjT „) is the probability that an expression of size n having root + (resp. — ) is 
composed of two subexpressions having sizes k and n — k. Since 

n—1 n—1 
k>l k>l 
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and p^jj = p'l^ when 7r+ = 1, then J2k>iPkn ~ ^/"^ claimed result holds. The results 

then specializes into E{Vn) = 1/2 in the uniform (7r+ = l,7ri = 1) case, and into E(K,) = 2/3 in 
the (7r+ = l,7ri = 2), both values being conjectured from FigurelH 

3.2.2. The rational case 

In this section, we show how to compute a fc-tuple of weights that is suitable for generating 
words according to given frequencies for a non trivial class of rational languages. As we will see 
in some examples below, the result generalizes to combinatorial classes whose generating functions 
are rational. 

If C is a rational language, then its (weighted) generating function writes 

S.«,u) 



(5^(i, u) 



where u stands for Ui, . . . , u^, and where there exists r > and 5i, ... ,5k > such that P^^ and 
Qtt are analytic in the domain T> = {(i,u) : |i| < r, \ui — 1| < Si^i}. 

We establish a simple formula for the average number of occurrences of each symbol in the 
weighted distribution. Quite noticeably, this formula does not require locating all the actual singu- 
larities, a difficult task as the weights are evolving, but only involves derivatives of and the 
unique dominant singularity. 

Proposition 5 Let C be a rational language counted by a (weighted) generating function ST^{t, u) — 
PTr{t, u)/QT^(t, u) such that ST^{t, u) has a unique dominant singularity G M+. For any i G [1, fc] 
and any k-tuple tt such that iTj ^ 0, Vj G [1, fc], we have: 

UiZ,, C, n) = p^l^ZLi^^(p^)„ + 

Ctt 

where 

and is the unique real zero of smallest modulus of Q^^lt,!). 

Proof. For the sake of simplicity, we make the ubiquitous dependency on tt implicit by dropping 
it from our notations. Let a G be the multiplicity of p as the unique dominant singularity 
of S{t,l). There exists a roots (pi(u), . . . , Pq(u)) of Q{t,u) such that Vj G [l,a], Pj{l) — p. 
Furthermore there exists a polynom R{t, u) such that 

a 

Q{t,u)^R{t,u)-l[{l-t/pj{u)) (16) 

and the function P{t, 1)/R(t, 1) is analytic at t — p, where it takes a positive real value k. 

[r]^(i,l) 

As will be shown in Proposition [51 we have f{Zi, C, n) — — — and 

[t"j5(t, 1) 

du,^' ' i?(t,i)p2(i_t/p)a+i + {i-pY 

Both S{t, 1 ) a nd ^-(i, 1) are rational generating functions and a generic treatment of such 
functions (See yields the following asymptotic equivalents: 

r]S{t,l) ^ n-- +0{n"-^p-n 

[a — l)!/9" 

inf(..i) ^ ..f|;zl|<ll)^.c-(„.-v-') 
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Remark that there exists degenerate cases where the multiplicity of p as a pole is decreased (or 
cancelled) by the derivative on m. Therefore the first term of the expansion may cancel but the 
statement remains valid thanks to the O(-) notation. Taking the ratio, we obtain the following 
equivalent for f{Zi,C,n) 

/(Z.,fin)^- ^^-^^"'^ ' n + Oil). (17) 
ap 



we obtain the following derivatives of Q 



nt dpi , , , , , OR , 

na ,^ , , dR , 
+ i,l 

P OUi 



-iCi(p) _ Z]"=i§^(l) 




and in turn 



P n 



c{p) ap 

where one recognizes the first term of Equation [T71 □ 
Now consider that one is given a fc-tuple (/ii, . . . , pk) and aims at finding a fc-tuple tt such that, 

for any i, f^^^Zi, C, n) ~ npi. Let tt^ be the weight of atom Zi for any i. 

Under the assumption of a unique dominant singularity in St^{z, 1), the following algorithm can 

solve the problem numerically if such a solution exists: 

• From Q^(t, u), compute c,r(0 ^^.d the C7r,i(i)'s (for 1 < i < k) where t and the tt^'s remain 
symbolic variables. 

• Build a system of k algebraic equations: 



Q,r(p,l) = 

(18) 



P ip) = Ml 



_lC7r,fe(p). ^ 

p — — (p) = Pk 

Ctt 

in the unknown variables p, tti , . . . , tt^ . 

Solve the system using numerical techniques (using FGb j 2Qj for example) 
• Among the solutions, take one for which p is real and has the smallest modulus. 

Remark 6 The prerequisite of Proposition [5] (uniqueness of dominant singularity) is satisfied by 
specifications associated with strongly connected, aperiodic automata, where the dominant singu- 
larity is known to be unique and has multiplicity 1 (See [Til Theorem IX-9, p656]). Such a property 
also holds for any specification whose strongly-connected components are aperiodic in the sense 
that, internally to each component, the greatest common divisor of all cycle length is 1 (Easily 
proved by induction). 

Remark 7 In the case of multiple dominant singularities, corresponding to periodic automata. 
Proposition [5] may fail. However it is worth mentioning that, using partial knowledge of the tar- 
geted length n, one can transform any rational specification into an equivalent one meeting the 
requirement of Proposition [5l 

Let C be a rational specification and C,-^d its restriction to objects of any size n' such that 
n' = r [D], respectively counted by 

fe fe 

S{t, U) = Y,Y. ^"'i ^" n ^^'D{t, U) = ^ ^ SND+r,i n "/^ ■ 

n>0 i>0 j=l N>0 i>0 j=l 
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Notice that, in order to avoid trivial periodicities in Sr.D(t,u), N is no longer the size of counted 
objects but rather the number of periods. 

We rely on the fact that, in any rational generating functions with positive coefficients (See [19^, 
Theorem V-3, p302]), there exists a modulus D E N+ such that, for any base r E [0,D—1], Sr.oit, 1) 
has a unique dominant singularity on the positive real axis. Since any dominant singularity pj is 

such that {pj/\pj\) = e "i where pj E N, qj E and gcA{pj,qj) — 1 (See [19|, Theorem IV-3, 
p267]), then a suitable value for D will be the least common multiple of all qj^s. 

Then a specification Cr,D counted by Sr.oit, u) can always be built from an automaton for C. 
In short, one starts by intersecting C with the language denoted by a rational expression m^.D 
generating all objects of size n' such that n' = r [D], given by 

mr,D = (Zi + . . . + Z\z\Y{{Z^ + ... + Z\z\fy. 

The minimal automaton for the intersection language (rational and constructible) only has cycles 
of lengths that are multiple of D. Sr,Dit,u) can then be obtained, either by only marking with 
the size variable t the atoms occurring at position p such that p = r + 1 [D] , or through a variable 
substitution in the resulting generating function. 

Finally, Proposition [5] applies to Sr,Dit,u) such that the weights tt and the average proportion 
Pi of an atom Zi are interrelated through p~^^^{p) = Dpi. Refiecting this slight modification into 
System [18] and solving the system gives suitable weights for large values of n such that n = r [D] . 

Example 3 (The Fibonacci language.) 

The simple and well known Fibonacci language is defined by the regular expression (a + bb)*, 
and admits a strongly connected aperiodic automaton. Suppose we want to generate words while 
biasing the average number of a's. We thus put a weight tt^ on the letter a. The weighted generating 
function writes: ^ 

1 - TTaUj - 

SO QtTo {t, Ua, Ub) = 1 — TTaUat — u'^t'^ . Wc havC 

C^^,a(t,Ua,Ub) = -TTat and Cr,{t, Ua, Ub) ^ -TTaUa - 2ult, 

which leads to 

-1 -T^aP 



I-Ka (a, S,n) ^ p 



-tTq - 2p 
n. 



Now let Pa be the desired asymptotic proportion of a's in the generated words, we just have to 
solve 



which gives 








and p 



1 - Mq 



This gives, for example, tTq — 2/\/3 sa 1.1547 (and p = l/VS ~ 0.577) in order to reach 
Pa = 0.5, that is an asymptotically equal proportion of a's and 6's in random Fibonacci words. 
Note that, in the uniform generation scheme (that is ttq = 1), we get pa — ^ ~ 0.447. Finally, it 
is worth mentioning that adding a weight nbt on each occurrence of bb leads to the simplification 
tTq — 2 Pa /{I + Pa) and TTbb = 1 — ""a- Figurc [5] shows random weighted Fibonacci words for different 
values of ttq. 
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Example 4 (Motifs in random sequences) 

We consider here the number of occurrences of a given motif in a random sequence. This is a 



classical issue in bioinformatics. Our approach follows, in some sense, the one in [2l|, though for a 
different purpose. Our example is the following: we want to fix the average number of occurrences 
of the motif aug in a random RNA sequence, that is a sequence on the alphabet {a, c, g, u}. In order 
to distinguish the aitg's, we mark the last g, replacing it with g. Hence, in fact we consider words 
on {a, c, g, g, u\ where there is no occurrence of uag and where every occurrence of g is immediately 
preceded by ua. Obviously, counting the aug^s in this language is equivalent to counting the aug^s 
in {a, c, 5, u}* . And, in order to generate words in the suitable alphabet, we will just have to replace 
each letter g with a letter g during the random generation process. 

Our language can be represented by the (strongly connected and aperiodic) deterministic finite 
automaton of Figure [6] or, equivalently, by the following non-ambiguous regular grammar: 

S'o e\aSi\cSo\g So\u So 

51 -> e \ a Si \ c So \ g So \ u S2 

52 — > e \ a Si \ c So \ g So \ u So 



Now by putting a weight tt^ on g, we are able to tune the number of occurrences of the motif. 
Namely we have: 

S^{t,a,c,g,g,u) = 



thus 

which gives 
and 

Hence we find 



1 — t{a + c + g + u) + t'^aug — iTgt^aug ' 
Q-jrit, a, c, g,g,u) — 1 — t{a + c + g + u) + t^aug — ngt^aug 

c^g,git,a,c,g,g,u) = -TTgt^ua 
c^g (i, a, c, 5, u) = —{a + c + g + u) + 3t'^uag — 3TTgt'^uag 



f^{g,C,n 



J 

4 - 3/92 + 3^_^2 



where p satisfies the equation (3^(p, 1, 1, 1, 1, 1) = 0. Thus we have to solve the system 

l-4p+(l-7rg)p3 = 

4-3p2 + 3^gp2 Aig- 

in order to find the suitable value of TTg that gives the desired asymptotic ratio Hg of motifs atg in 
the words to be generated. For example, setting /ig = 0.1 gives tt^ « 11.148 and setting fig — 0.01 
gives TTg « 0.621. Note that, in the uniform generation scheme (that is TTg = 1), we would have 
= ^ « 0.016. 

Let us take an additional parameter into account. We aim to fix the (joint) proportion of letters 
a and u in the sequences, which is called the "a + u content" in bioinformatics. This is a natural 
issue in bioinformatics, where the observed frequencies of nucleotides have to be taken into account. 
To this purpose, let us replace each letter a or u with a new letter a, and let us put the weight tTq, 
on this letter. We get 

Q^{t,c,g,g,a) = 1 - t{2TTaa + c + g) + Trlt^a^g - TTgTTlt'^a'^g 
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then 



and 



"{^^,^s)i'tiC,g,g,a) = - {2Tr + c + g) + Snlt'^a'^g - STrlirgt'^a'^t 



Hence 

U{g,C, n) 

and 



Now, adjusting the a + u content and the number of motifs atg reduces to solve a system of three 
algebraic equations in tt^, tt^, and p: 



{ l-2p(l + 7r„)+p3^2(i_^_) ^ 



2 + - 37r2p2 + 37r2 7rgp2 

27rQ(l - -KaP^ + -naT^gP ) 



^ 2 + 27rQ - 37r2p2 _|_ 37^2^_p2 



Mo 



For example, setting p^ — 0.7 and /ig 0.1 gives tTq « 2.475 and Tig ~ 9.430 (with p ss 0.128). 
Example 5 (RNA multiple stem-loops) 

Here we show that Proposition [5] can be sometimes apply in some cases where the language is not 
rational. At first, let us consider the following language : L — {a^d^U^ : m,n > 0}. In molecular 



biology, this represents what is called a stem-loop in a RNA secondary structure (see [22] or [23[ for 
details). Roughly, a's and 6's represent paired nucleotides (in the stem), while c's represent unpaired 
ones (in the loop). Now let us define the language L' = d*{Ld*)* . that is the language consisting 
in series of stem-loops, where each two consecutive stem-loops are possibly separated by stretches 
of unpaired nucleotides, represented by d's. Obviously L and L' are not rational languages, but 
their generating function are rational. Indeed, there is a straightforward one-to-one correspondence 
between the words of L' and the words of the rational language d*{{ab)'^c'^d*)*. Additionally, the 
minimal automaton of this language is aperiodic and strongly connected, thus Proposition [5] holds. 

Suppose we aim to generate words of L' while fixing the average number of stem-loops and the 
average number of paired nucleotides. For the latter, it suffices to put a weight tTq on each letter 
a. As regards the number of stem-loops, let us distinguish one letter in each loop (for example the 
last one) by changing the c to c. Now our language obeys the following grammar: 

S DT S\D 

T aTb\aCb 
C cC\c 

D -> dDle 



The weighted generating function is 
S-n-{a, b, c, d) = 



1 — tc — TTat^ab + TTat^abc 



1 — t{c + d) — t'^{TTaab — cd) — TTat^iT^sO-bc — ttbc — abd) — iTat'^abcd 
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Finally we find the following system: 

1 - 2p + (1 - Tra)/?^ + (27ra - 7ra7re)p3 _ 7r^p4 

, 2 + 2p(^a-l)+3p2^„(7rE-2)+ VTTa 

It can be solved symbolically, leading to 

_ 1 - 2^a - 

^ 1 - 2^a + Aig 

_ if^a - A*g)(l - 2pa + p,cf 

Pa{^ - 2pa - tJ'cY 
^ 

' " (^Q - ^e)(l - 2/^Q - /xg)(l - 2/^Q + /xg) 

Note that we must have 2pa + /^e < 1 since there are as many &'s as a's in the words to be 
generated, and room must be left too for c's and d's. For example, setting p,a = 0.4 (for 80% of 
paired nucleotides in average) and /ig — 0.1 (for n/lQ stem-loops in average in a structure of size 
n) gives tTq — 27/4 and TTg = 4/9 (with p = 1/3). 



= 

= Ma 
= Pc 



3.3. Computing weights for fixed lengths: An heuristic approach. 

Now we address the problem of finding suitable weights for expected frequencies in its most 
general setting. Indeed, it is not always possible to apply purely analytic methods such a the ones 
described in Section |3.2[ or even only to compute explicitly the generating function. By contrast, 
it is always possible to translate an unambiguous context-free grammar into a recurrence equation, 
which allows for an exact evaluation of the numbers of words in the grammar. Applying this 
method to the weighted context-free languages gives an algorithm, described in Subsection 13.3.11 
for computing the frequencies associated with given weights. From this, we can use a continuous 
optimization algorithm described in Subsection l3.3.21 to obtain a precise approximation of suitable 
weights. 



3.3.1. Preliminary: Computing frequencies from weights 
Let us consider the following generating function: 

5^(^,u) = ^^(s)^Hul^l"^..4^l"^ 

sec 

where u = (iti, . . . ,Uk). We can write 

n,ji,...,jk>0 

where TTn,ji,...,jk stands for the sum of weights of the structures of size n having ji occurrences of 
atom Zi for alH = 1, . . . , fc. The following result holds: 



Proposition 8 Let fT^{Zi,C,n), be the expected number of occurrences of Zi in the structures of 
Cn generated by the algorithm. We have: 

Proof. This is a standard result. By definition, we have 

tt{s) 



fAZ„C,n)= J2 \^\zMs)^ J2 1^1^. 
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from P(s) — -^^^ by Formula (jlOl) . The numerator is obtained from 



while the denominator arises from 



r(C„)= ^ ^„,,„....,, 



□ 



This resuh ahows to compute j-^{2,i,C^n) from the generating functions S'7r(i,u). However, 
computing the partial derivatives requires a closed- form expression of the generating function S'tt, 
which can be hard to obtain for complex grammars. Therefore for practical applications, we propose 
a different approach based on recurrence formulae. 

Proposition 9 The frequencies f-n-{Zi,C,n) associated with all Zi's can be computed in 0{n'^) 
arithmetic operations. Moreover, if C uses only the product and union constructs ( context-free 
language) , then there exists a ©(n^) arithmetic operations algorithm for computing the f^^lZi, C, n). 

We define gT^[Zi, C, n, m) to be the sum of weights for all structures in C„ featuring m occurrences 
of Zi. Then we have: 

{Tr{Zi) = TTi if i — J, n = 1 and m — 1 
Tr{Zj) = TTj if i ^ j, n ~ 1 and m = 
otherwise 
> gT,(Zi,C,n,m) = A, n,m) + g^{Zi,B,n,m) 

n — l rn 

' gn{Zi, C,n, = ^ Y^gT^{Z^,A,a, b) . g-^iZi, B,n - a,m - b) 
' g-Ki2i,C,n,m) ^ n . g^{Zi,A,n,m) 



c = 


2j 


c = 


A + B 


c = 


Ax B 


c = 


OA 



and then in turn 



J2m=o97^iZ^,C,n,m) 

These recurrence relations lead to an algorithm, which needs to compute a table of the values for 
each gT^{Zi, C, n, m). Its size is 0{n^), and each entry needs, at worst, 0{n'^) arithmetic operations. 
Thus the overall worst-case complexity for computing the expected number of occurrences of any 
atom Zi in a structure of size n is 0{n'^). 

An alternative way for computing these frequencies in context free grammar specifications is 
based on a generalization of the grammar transform associated with the pointing operator (See [2,] for 
examples). Namely, we introduce a partial pointing operator which duplicates objects by marking 
any occurrences of a given atom. For context-free languages, we show how to adapt a specification 
for the partially-pointed language from the input grammar. Extracting coefficients from the result- 
ing grammars gives us both the numerators and denominator of equation [11] at the usual cost of 
coefficient extractions, effectively improving on the complexity of the previous method. 

Let us first define the partial pointing operator , taking a class C and returning a class C*' 
whose members are obtained from a member of C by pointing an occurrence of Zi. Consequently 
any object a € C gives rise to a number of objects in C that is equal to its number of occurrences 
of Zi, and the ordinary generating function of C** is therefore clearly ^f-. 

Based on the obvious combinatorial interpretation of the partial pointing operator, it is possible 
to build a grammar C/** for partially pointed language from the rules of an initial context free 
grammar Q. Generalizing from the rules used for the general pointing operator we obtain 

C ^ A\B C" ^ A'^ I B'^ 

C^A-xB C" ^ A'' ■ B \ A- B'' 



C^Zj ^ C 



Otherwise. 
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The symbol tags as non-productive a non-terminal C, which can be eliminated through an iterated 
post-treatment. However non-necessary, this may decrease the constants involved in the complexity 
of this approach, since the complexity of our enumeration algorithm depends, in a somewhat hidden 
fashion, on the number of non-terminals. 

Using counting rules from Table [TJ we can then evaluate the number g** of words of size n in 
Since the generating function S'**(i, u) of Q'^ is such that S'^{t, m) = m- ^"^g^*'"^ , then we have 

The expression of Proposition [5] for can then be rephrased as follows : 

Qn 

Since both 5** and gn are numbers (resp. total weights in weighted specifications) of words 
in a context-free grammar, they can be computed in 0{n?) arithmetic operations and in 0(n'^) 
space complexity and so can fT^{Zi,Q,n). These can be lowered to 0{n) arithmetic operations 
and Q{n?) space complexity by using the linear recurrences obtained for any grammar by symbolic 



methods (GFun [2j|). Although this approach could in principle be adapted to general standard 
specifications, it is unclear at the moment how some of the partial/general pointing/unpointing 
combinations may interact, and we favored the former approach in our implementation despite its 
higher theoretical complexity. 



3.3.2. Assessing suitable weights through an optimization heuristic 

Remember we want to find a fc-tuple of weights tt = iT^i)ieli,k] that achieves targeted fre- 
quencies (/ii, . . . , fik) associated with our k distinguished atoms (Zi, . . . , Zk). To that purpose, we 
reformulate our problem as an optimization one. 

Let $ : M''" X N M'' be the function that takes a fc-tuple of weights tt = (tti, . . . , TTfe) and a 
length n G N, and returns the fc'-tuple of frequencies observed among words of length 

n. We described in Section 13.3.11 two methods to compute the function $ which, in addition to an 
expected smoothness of the function $, allows us to foresee an efficient optimization approach for 
the inversion of $. More specifically, we want to find weights that achieves targeted frequencies 
M = if^i)ie[i.k] ■ To that purpose we reformulate our problem as an optimization problem by defining 
an objective function : M'"' x N — > R such that 




We point out the fact that 

(F(7ri*,...,<.,n)=0) ^ ($K,...,7r^,n) = (Mi,...,Mfc)) 

so that solving the former yields a solution for the latter. Also, it is worth noticing that, thanks to 
the partial pointing described above, F can be computed in 0(fc • n) arithmetic operations. 

CONDOR is a continuous optimization algorithm, developed and implemented by Vanden Berghen 
et al 25]. It attempts at finding the values for a set of parameters that minimizes an objective 
function. It proceeds by building a local approximation of F around a given point, as a polynomial 
of degree two and uses it to perform an analog of a steepest descent while maintaining a trust 
regions. We used a C-|--t- implementation of the CONDOR algorithm, downloaded from F. Vanden 
Berghen's website. We implemented the partial pointing algorithm described in Section 13.3.11 for 
the computation of $, using the C++ arbitrary precision library apf loat created by M. Tommila. 
We combined these three components into a software GRGFreqs, which takes as input a grammar 
formatted as a GenRGenS description file with additional target frequencies for the terminal 
symbols, and iteratively finds a set of weights that achieves such frequencies. 
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By contrast to the analytic approach, which rehes on the assumption that the asymptotic regime 
has been reached, this approach works for fixed, potentially small, values of n. Moreover it is fully 
automated and does not require any interaction with a computer algebra system. This allows for 
a computation of suitable weights, even for complex grammars for which solving the associated 
systems of functional equations by computer algebra is challenging. Finally it is also possible to use 



sophisticated methods inspired by 17[ to achieve exact values for F, or just to take advantage of 
the numerical stability of our algorithm and set the precision of the mantissa to a large fixed value. 
Since the CONDOR algorithm uses real numbers internally, this allows for a reasonably accurate 
computation of suitable weights, as illustrated by the following application. 

Remark 10 As pointed out by one of the referees, one can bound the error made on targeted 
frequencies when using fixed-precision reals for computing the weights. Let n^, ■■■,'tI be the exact 
solution, i.e. a set of weights that generates the atoms with the targeted probabilities fii,...,^k- 
Now suppose that floating point approximations tti, ...,7rfc are used instead of exact weights, then 
one can define the relative errors Si as tt^ = (1 + Si)TT* . Consider the maximal and minimal relative 
errors Mg = maxi(ej) and to^ — mini(ei), then one has 

(l + m,)"^*(s) < 7r{s)=TT*{s) ■ l[{l + s,p^^ < {I + Mj'n* {s) 

l<i<k 

and similar bounds hold for 7r(C„) the cumulated weights of structures of size n. By construction, 
each structure is generated with probability P(s) = J{c\ therefore we have 



(l/q) ■ P*(s) < P(s) < q ■ P*(s), with q := 



1 + 



Let us now use floating point arithmetics with a binary mantissa of a given fixed size b. Assuming 
that the method converges toward the closest expressible approximation of tt*, one has = —2^"'' 
and = 2^~^ . One can then compute a precision b such that the sampling probability P(s) for 
any structure deviates from the targeted one P*(s) by less than some e E [0, 1[; 

(1 - e) • P*(s) < P(s) < (1 + e) • P*(s). 

It can be easily shown that q < 1+e implies 1/q > 1 — e,so we are left to find a precision b such 

that 

Applying the natural logarithm on both sides, one obtains 

n (log(l + 2'-") - log(l - 2'-")) < log(l + e) 
Taylor expansions can be used for both logarithms, simplifying into 

log(l + X)- log(l - X) = 2X + X - Y] — <3X, VO < X < 1/2. 

k>i '^^ + ^ 

Here X = 2^~^ and the X <\/2 condition holds for any 5 > 2, so any b such that 

^ ^ log 3 + log(n) - loglog(l + e) 
log 2 

will achieve a relative error less than e. 

Future directions for this research will aim at replacing the current optimization scheme with a 
numerical iteration, following the pioneering work of Pivoteau et al 27] for computing the so-called 
Boltzmann oracle. 
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3.3.3. Application 1: Altering the node degree distribution for quadtrees 

Quadtrees are data structures, mostly used in computer graphics to partition the view plane, 
thus helping in determining which parts are obfuscated, or which geometrical objects are in collision. 
Considered as a combinatorial object, a quadtree can be recursively defined as either an empty tree, 
or a tree having four children, denoted by their orientations (Northern-eastern, southern-eastern, 
southern- western and northern- western) . This definition gives rise to the following context-free 
grammar 

S^aSbScSdS\e 

which generates all quadtrees through an encoding similar to that of Dyck words for binary trees. 
More specifically, it can be shown that the number of words of length An generated by this grammar 
is exactly the number of quadtrees having n internal nodes. 

Now, we defines the degree of a node to be the number of its non-empty children. 

The grammar above can then be altered in such a way that each production will create a node 
of known degree i, marked by an occurrence of a distinctive letter af. 

S ^ T\e 

T OiT bT cT dT 

a'ibTcTdT\azTbcTdT\as.TbTcdT\as.TbTcTd 
a2bcTdT\a2bTcdT\a2bTcTd\a2TbcdT 
a2TbcTd\a2TbTcd 
aiTbcd\aibTcd\aibcTd\aibcdT 
ao b c d 

Computing the proportions of symbols {ao, . . . , 04}, which can be done for instance by one of the 
algorithms from Subsection I3.3.ip . yields the distribution of node degrees for increasing lengths 
plotted in Figure |8l This distribution shows uneven proportions of each types of nodes. 

Assume we want to draw quadtrees at random in a weighted model, chosen such that the 
proportions of nodes of degree 1, 2, 3 and 4 are equal, while leaving out nodes of degree as a 
necessary degree of freedom. Furthermore, we want to make sure that there exists a quadtree that 
achieves the target frequencies. Let {ng, . . . ,^14} be the numbers of nodes of respective degrees 
{0, . . . , 4} in a quadtree, then our quadtrees must obey the following constraints: 

• The number of nodes n in any tree is related to the sum of degrees. 

• The numbers Ui of nodes of different degrees have to sum to n. 

• Nodes having degrees 1 to 4 have to be equally represented. 
These constraints translate into the following system 

Ouq + Ini + 2n2 + Sna + An^ — n — 1 

no + ni + n2 + n^ + — n 
ni — n2 = = = k 

Solving the system yields the following values in no and k: 

_ 3n+2 

no - -5- 
— 10 

A corollary is that our set of constraints can only be fulfilled by trees of size equal to 1 modulo 10. 

For instance, any quadtree of size 201 that meets the three conditions above will necessarily 
contain 121 nodes of degree and 20 nodes of each other degree. Figure [5]-Left illustrates a run 
of our software GrgFreqs using such proportions as target (121/201 for nodes of degree and 
20/201 otherwise). After about 100 evaluation of the objective function, a fc-tuple tt of candidate 
weights for symbols at, giving rise to a value 3.6 10~^ for the objective function, was found. From 
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Remark [TOl the weights can be safely truncated to 6 decimal digits to ensure a 10 precision in 
each frequency, thus we obtain 

Letter oq Oi 02 03 04 

Weight 7r(a,) 1.0 0.0711964 0.0819891 0.212971 1.47891 

Frequency /* (%) 60.19949 9.94975 9.95000 9.95024 9.95049 

Using these weights, it is then possible to replot the average frequencies for these symbols for 
sizes between 1 and 100 (Figure l^l-Right). The modification of the average profile resulting from 
adding such weights is illustrated by random instances drawn in Figure 1101 

Finally, as pointed out by one of the referees, there also exists a simple and efficient ad hoc 
way to generate quadtrees that obeys to an exact degree distribution. This can be done through 
a well-known bijection between the set of trees having nodes of degree less than a given k and the 
Lukasiewicz language on the alphabet {ao,ai, . . . ,afc} 128] . The letter in the Lukaciewicz word 
corresponds to a node of degree i in the left to right depth-first traversal of the tree. For adapting 
this bijection to quadtrees, we set fc = 4, and each letter must be colored to differentiate the 
children's positions of a node. For example, there will be 6 different colors for 02 since there are 6 
ways to choose two leaves within the four possible nodes. Thus, to generate a tree with the node 
degree distribution (no, jt-i, n2, 71.3, 71.4), it suffices to generate a random word with no occurrences 
of the ao symbol, ni symbols ai (with 4 possible colors) , 71,2 symbols 02 (6 colors), symbol 03 
(4 colors), 714 symbol 04; Then use the Cyclic Lemma [29[ to change this word into a Lukaciewicz 
word, which corresponds to a quadtree, and finally build the quadtree for a total 0(71) complexity. 

3.3.4- Application 2: Realistic RNA secondary structures 



Features of a realistic modeL The combinatorial properties of RNA structures have been 
thoroughly studied 22, 23, 30, ml, [H [li. The asymptotical analysis of the uniform model |3Q(, 



|34| shows striking dissimilarities between the structural features of the uniform model and those 
experimentally observed. By structural features, one understands: 

• Proportions of paired and unpaired bases 

• Numbers and average size of hairpin, bulge, interior, and terminal loops 

Figure [11] (upper-left) illustrates the principle of a loop decomposition, underlying the so-called 
Turner model of energy [35| . We show how weighted grammars provide in such a case with an 
elegant way to build a model that captures observed properties. 



Annotation of existing structures. First, we evaluate our features on a database of known 
RNA secondary structures [36], previously used to benchmark thermodynamics based approaches 
for the ab-initio folding problem. To that purpose, we annotate these secondary structures as 
follows: 

- Replace each base with a character depending on the type of loop it belongs to: Hairpin (h). 
Bulges (b). Terminal loops (t). Interior loops (i) or Multiple loops (m). 

- Bold characters (h, b, t, i, and m) are used for the first element of each loop. 

The result of this process is illustrated by Figure [TT] Through a carefully designed recursive scheme, 
this operation can be performed in linear time. We get the following frequencies for each characters 
among the whole database of secondary structures: 



Feature 


bbi immt t hh 


Target freq. (%) 


1.5 2.3 1.9 11.2 1.1 9.0 2.6 16.6 4.8 48.9 
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Structural features of the uniform model. Then, we use a general grammar, independently 
proposed by one of the authors [sJl and M. Nebel 37 1, from which these features can be distin- 
guished: 



S T \ H \ B H \ H B \i I H I i \ M \ e 

T tt^^^lTt 
B h\Bh 
I e I /i 

H h h 

H' hH' h\T \ B H \ H B \ il H I i \ M 

M H M \ mM" H M' 

I m M" H M" H M" 
I HxaM" H M" 
I H HmM" 
M' M" H M' 

M" H M" H M" 
M" M" m I e 

This grammar ensures that at least r unpaired bases are found in each terminal loop. Additionally, 
this grammar requires at least one unpaired base to be found in each multiple loop, since we need 
to mark each occurrence of a multiple loop with a character m. 

A combinatorial validation for this complex grammar can be found in the following way: Set r — 
1; Replace M by M' in the right hand sides of the grammar; Translate the grammar into a system 
of functional equations on the univariate generating functions associated with each non-terminal; 
Solve the algebraic system. We obtain the generating function of RNA secondary structures as 
first counted by Waterman [22]. It is worth noticing that doing the same with r = gives the 
Motzkin numbers. Therefore we claim that the restrictions imprinted in our grammar only induce 
a controlled and biologically relevant loss of generality. 

In the rest of this study, we will focus on RNA structures having 300 nucleotides. We use 
GRGFreqs to evaluate the exact expected frequencies for each of the terminal symbols in the uniform 
model Aloj and obtain the following frequencies: 



Feature 


b 


b 


i 


i 


m 


m 


t 


t 


h 


h 


Mo (%) 


7.2 


5.6 


2.8 


7.3 


3.7 


7.6 


5.2 


14.5 


18.6 


27.5 


Target 


1.5 


2.3 


1.9 


11.2 


1.1 


9.0 


2.6 


16.6 


4.8 


48.9 



Adequate weights for hairpins. Since the optimizer complexity empirically grows quickly with 
the number of variables, we will first focus on hairpin features, for which the highest discrepancy is 
observed between the uniform model and real structures. Namely, we will build an Helix model 
A^-H, that achieves average expected lengths and frequencies for hairpins similar to that of real 
structures. We slightly alter the general grammar in order to anonymize all symbols for which we 
do not need a specific weight to be computed (b, b, i, i, m, m, t and t), replacing them with a 
generic letter u. The respective targeted frequencies (/iu, /^jj, ^J'■\^ for u, h and h arc then such that 

^in = 46.3 = 4.8 = 48.9 

We run GRGFreqs with these settings, and observe the optimization scenario from Figure [T2l (Left 
part). After only 150 evaluations of F, a candidate set of weights for u, h and h is found such 
that associated frequencies only deviate by less than e^^^ sa 1.6 10^^ from the target frequencies. 
Namely, we get 

7r« = 1.0 7rg « 3.6036391 10"^ ttT^ w 1.1359318 



20 



Using these weights, we can exactly compute the frequencies for the full set of atoms in the Helix 
model M-H- 



Features 


bbi immt t h h 


Mn (%) 


0.6 2.3 1.2 10.4 1.8 15.5 2.2 13.0 4.8 48.9 


Target 


1.5 2.3 1.9 11.2 1.1 9.0 2.6 16.6 4.8 48.9 



Adding constraints to multiple loops. From the values just above, we can see that the biggest 
divergence between the model Ai-u and real data resides in multiple loops. Since these act indirectly 
on the connectivity of the tree backbone of sampled structures, it may be useful to further constraint 
associated features (Characters m and m). Therefore we propose a loop model AAc which adds 
m and m to the constraints of the previous model helix model: 

/iu = 37.3 /im = 1-1 /^m = 9.0 /ij^ = 4.8 /ijj^ = 48.9 

Running GRGFreqs with these new settings yields a set of weights 7r£, that scores less than e~'^°-^ 
2.76 • 10~^, after about 1000 evaluations of the objective function. 

vTi^ = 1.0 Trfj w 1.138626 tt^ « 2.168521 tt^ « 3.422990 10~^ ttj^ « 1.246468 



Feature 


bbi immt t h h 


Mc (%) 


0.6 3 1.5 15.9 1.1 9.0 1.9 13.2 4.8 48.9 


Target 


1.5 2.3 1.9 11.2 1.1 9.0 2.6 16.6 4.8 48.9 



From these three models, it is possible to use our prototype to generate random structures of size 
300, draw them using the RNAPlot tool from the Vienna package [s^l and compare them visually to 
the real ones. We observe in Figure [T3l a clear progression from the messy Mo to the more realistic 
Mc- This illustrates the ability of our program to assist in the design of models for biological 
sequences and structures. 

4. Generation according to exact frequencies 

Here, given a targeted size n and a fc-tuple (ni,...,^^) of integers, our goal is to generate 
uniformly at random a structure of C„ which contains exactly Ui atoms Zi for sl\ 1 < i < k. Let r 
be the number of occurrences of undistinguished atoms in the structure: we have r — n — X^iLi 
The principle of the method that we describe here is a natural extension of the general outline given 
in Section [21 

A first general algorithm was given in jl7| by two of the authors of this article. Here we present 
an improvement of that algorithm. 

Proposition 11 The generation of m .structures of .size n = Ui + ■ ■ ■ + + r featuring exactly Ui 
occurrences of atom Zi can be performed in 0{r^Y\'^^in^ + mnfc log n) arithmetic operations for 
general specifications, or in C'C^'JliLi "-i + mn) for regular specifications. 

For any class C given as a standard specification, we write Cj-^^,,,_ji^^r for the number of structures 
of C of size n — r + Jii which contain ji atoms Zi for each i g [1, A:], and r other atoms. For 
short, we can also write cj, where j ~ (ji, . . . ,2kj r). 

Let us first outline the algorithm given in [17| . The preprocessing stage consists in computing a 
table of the Cj-^....j^^r for {0 < ji < fc] ^^'^ < r < n — J2i=o This requires computing a 

table of 6(?'rii=i entries, with the recurrences stated in Table|3l Since 6(''ni=i arithmetic 
operations are required to compute each entry, this preprocessing clearly takes time 8(r^ IliLi Ji ) 
for general specifications. For regular specifications, given using only rules of the form C — TiB, 
Ti = Zi and C = 1, only one of the entries associated with the T^'s is non-null, and the product 
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00000000000000 

TT = 1/4 fc = 11.11...% 

00000000000000 

TT = 1 <^ /c = 33.33 ... % 

0000000000000 
7r = 2 <^ /c = 50% 
0000000000000000 
TT = 18 /c = 90% 



Figure 3: Unary-binary trees associated with weighted Motzkin words of size 500, for different values of n the weight 
of unary nodes. 



C = 1 ^ co,o,...,o = 1 ; 

C = Zi => co,....o,i.o,...,o = 1 (ji = 1) 

C = A + B Cj = Oj + 6j ; 

C = AX B Cj=X; ^•J+j-J'=/jl,-J^,,r'''jl'v..,jX',r" ; 

r'+r"=r 

r'+r"=r 

C = QA 

Table 3: Counting procedures for standard specifications in the case of the random generation ax;cording to exact 
frequencies. 
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Figure 4: Average value of an arithmetic expression, computed by generating 100 000 random expression, for various 
sizes n and frequencies of symbols + and 1. 



TTa = 0.5 



7I"a = 1 


TTa = 1.1547 



TTa =2 



TTa = 10 



Figure 5: Sets of randomly generated Fibonacci words of length 100 for different values of tTq. White boxes: a's; 
grey boxes: 6's 



rule can be evaluated in 0(1) arithmetic operations, bringing the preprocessing complexity down 

toe(rnti«i)- 

Now, each step of the generation stage consists in choosing a rewriting rule of the current class. 
Suppose that, at a given step of generation of a structure having distribution j = (ji, . . . , j^, r), 
one has to choose a rewriting rule for the class C. U C = A + B, one generates a structure with 
distribution j deriving from A with probability flj/cj, or deriving from B with probability &j/cj. If 
C = Ax B, one chooses a vector h = {hi, . . . ,hk,s) with probability ah&j-h/ch- Then one generates 
a structure deriving from A having distribution h and a structure from B having distribution j — h. 

This generation stage, which has a worst-case complexity in 0(" rif=i"i)' can be improved 
drastically. Indeed, the bottleneck of the above procedure is the C = A x B case, where there are 
jij2---jk'f possible different choices. Now, let c|''^' ' ''''\ be the number of structures generated 
from C, having distribution (ji, • ■ ■ ,ife, r) and such that, for each x G [1,*], exactly hx of the 
targeted jx occurrences of atom Zx are generated from A. We have: 

hi+i<ji+i hk<jkr'<r 

Now the probability of counting hi atoms Zi in the structure from A, given that the structure 
contains hi atoms Zi, . . . , hi-i atoms Zi-i is: 



¥{hi\hi,...,h,-i) 



Ul,---,3i,---,jk,r) 



and the probability of counting hi atoms Zi in the structure from A is: 

F{hi\0) = Jiiiiiiiii!l. 

'^jl,---,jk,r 

This allows to choose the adequate decomposition hi, . . . ,hk sequentially. Since picking a suitable 
value for hi involves investigating at most jt alternatives, the overhead compared to the classic 
generation is limited to a factor 0{k). 

Hence the whole algorithm is as follows: 

1. Preprocessing stage. For any combinatorial class C in the standard specification, compute a 
table of the Cy^'^'.'.'.j^l.j^^r) for 1 < i < fc, {0 < jx < nx}xe[i,k] and {0 < hx < jx}xe[i,i]- ^his 
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Figure 6; A finite state automaton recognizing the language generated by the grammar. 





Figure 7: General principle of our heuristic approach to the problem of computing weights tt that achieve targeted 
frequencies fi. 

can be done with the same recurrences as for the previous approach. Indeed the '^^\) 
are in fact partial sums of the one involved in products, and can therefore be computed on the 
fly during the computation of coefficients Cji.... This gives a complexity in 0{r^ Y[i=i 
arithmetic operations, while requiring storage of ©(fe'^IliLi ^i) numbers. 
For regular specifications, the sums associated with product rules only have one non-null term, 
so we can add a specific counting procedure 

C = Ti X A => Cjj^^,,,ji,^r — Cji,... j-^-lj... 

which lowers the time/space complexity to 6(''ni=i "-i)- 
2. Generation stage. The C — )• 1, C — > -E,;, and C ^ A + B rules are trivially borrowed 
from [l7|. In the case of product rules, a sequential choice of h described above leads to an 
overall generation complexity in ©(rnn log n) arithmetic operations through a Boustrophedon 
investigation (See [l|) of eligible decompositions in each dimension. 

Remark 12 (Multidimensional Boustrophedon) Let us discuss the improvement observed by 
adopting a Boustrophedon order of investigation in this multidimensional scheme. We remind that, 
during the generation stage for products (x), the Boustrophedon search consists in investigating 
potential partitions of the targeted size from the edges toward the middle ((0, n),(n, 0),(1, n— 1),. . . ) 
instead of sequentially ((0, n), (1, n), . . .). In the unidimensional Boustrophedon generation [l| the 
worst case complexity f{n) of the generation follows 



/(n) = max (2min(a,6) + /(a) + /(6)) (20) 

a+b—n 

which has a O(nlogn) solution [s^l- In the multidimensional case, let c = (ci,...,Cfc) be the 
targeted k-tuple of occurrences, then the worst case complexity of our algorithm is given by 

/ k \ 

g{c, r) = max 2 min(r', r") + 2 \^ min(ai, hi) + g(a, r') + g{h, r") 

a.b.r' r'' s.t. V — ^ / 

r +r — r 

Let |x| = Y^\^i Xi, then one has 

k 

2min(r',r") +^min(ai,6i) < min(r' + |a|,r" + |b|). 

i=l 

and a straightforward induction shows that 

5(c,r) < /(|c| +r) G ©(nlogn). 

In the case of regular specifications, only binary decisions appear and the generation can be per- 
formed in Q{mn) operations. 
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Figure 8: Evolution of the node degree distribution for trees of increasing size in tlic uniform model. The asymptotic 
proportions of nodes of degree (0, 1,2,3, 4) are respectively (81/256, 27/64, 27/128, 3/64, 1/256). 

00 

Figure 9: Left: Weight optimization for weighted quadtrees of size 201. The targeted proportions arc 121/201 (resp. 
20/201) for nodes of degree (resp. 1, 2, 3 and 4). 

Right: Node degree distributions for weighted quad trees of increasing size in our weighted model. Although formally 
the computed weights only work for size 201 structures, a good approximation of the targeted distribution is already 
observed for smaller sizes. 



5. Conclusion 

In this paper, we introduced and developed a new scheme for the non-uniform, yet controlled, 
generation of combinatorial structures. First we addressed the random generation according to 
expected frequencies, motivated both by bioinformatics and computer science applications. We in- 
troduced the notion of weighted standard specification, and derived a random generation algorithm 
based on the so-called recursive approach taking O{mn\ogn + for the generation of m 

structures in the according to the weighted distribution. We showed that computing asymptotic 
weights, i. e. weights that are suitable for asymptotic targeted frequencies, can be reduced to 
solving an explicit algebraic system. For fixed sizes, we gave two distinct algorithmic approaches 
for the opposite problem, i.e. the comptitation of atom frequencies achicived by given weights, with- 
out solving any functional algebraic system. The first works for every standard specification and 
takes 0{k ■ n'^) arithmetic operations whereas the second works for context-free languages and uses 
grammar transforms to compute all frequencies in 0{k ■ n^) arithmetic operations. This allowed us 
to reformulate the problem of computing suitable weights as an optimization problem, which we 
solved in a heuristic fashion. Finally, we addressed the exact frequency generation and derived a 
recursive algorithm that generates m words having a predefined atoms distribution (ni, . . . ,nfe,r) 
in O{mn\ogn + 11^=1 ) arithmetic operations. 
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Uniformly generated quad trees 




Generation using calculated weights 



Figure 10: Typical sets of randomly generated quad trees of size 201 in the uniform model (Top) and using weights 
output by our optimizer, whose objective was to balance the numbers of nodes for each degree (Bottom). We show 
here the tree representation of quad trees in addition to the classic square one, since the latter tends to overemphasize 
nodes of low depth. 





Structure: .((((.(((..((((....)))))))..(((.(((....)))..))).)))).. 
Annotation: IHhhhMHhhBbHhhhTttthhhhhhhmmHhhlHhhTttthhhiihhhmhhhhii 

Figure 11: Different types of loops in an RNA secondary structure (Left), principles of our structure annotation 
(Right) and result of the annotation (Bottom). 





Figure 12: Minimization of the objective functions in the Helices (Left) and Loops (Right) models. A logarithmic 
scale is used for the value of the objective function (Y-axis). 




Uniform model Mq 




Helices model A4-H'- Constraints on expected number and length for hairpins. 






Loops model Mc- Constrained hairpins and multiple loops. 






Native structures: Real structures of size ± 300 excerpted from (36| . 

Figure 13: Typical random structures of size 300 in the three studied random models of increasing fitness, and in 
real structures of similar size. 
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